Skip to content

fix(workspace): enforce isolation policy, surface degraded mode in TUI#249

Open
windoliver wants to merge 17 commits intomainfrom
worktree-temporal-whistling-sundae
Open

fix(workspace): enforce isolation policy, surface degraded mode in TUI#249
windoliver wants to merge 17 commits intomainfrom
worktree-temporal-whistling-sundae

Conversation

@windoliver
Copy link
Copy Markdown
Owner

Fixes #208.

Summary

  • Workspace provisioning failures in SessionOrchestrator.spawnAgent() and SpawnManager.spawn() were silently swallowed — agents ran in the shared project root with no error surfaced
  • Replaced blanket catch with policy-aware WorkspaceIsolationPolicy (strict | allow-fallback) and WorkspaceMode tagged union (isolated_worktree | fallback_workspace | bootstrap_failed)
  • TUI spawn-progress screen now shows a [shared workspace] / [no config] badge + inline ⚠ <reason> for degraded agents

Changes

Core

  • workspace-provisioner: fully async via execFileAsync (no event-loop blocking, no shell injection); parallel cleanup with Promise.allSettled; WorkspaceMode + WorkspaceIsolationPolicy exported from core/index
  • session-orchestrator: new provisionAgentWorkspace() helper with exhaustive policy-aware control flow; workspaceMode on AgentSessionInfo; default policy strict
  • session-service: threads workspaceIsolationPolicy through to orchestrator

TUI

  • spawn-manager: replaces inline execSync git worktree add with provisionWorkspace(); setIsolationPolicy(); workspaceMode on SpawnResult; default allow-fallback (backward compat)
  • spawn-progress: [shared workspace] / [no config] badge + ⚠ reason line for degraded agents
  • screen-manager: threads workspaceMode from SpawnResult into AgentSpawnState

Tests

  • 7 new workspace-bootstrap unit tests
  • 5 new policy enforcement tests in session-orchestrator
  • 2 new policy enforcement tests in spawn-manager
  • All existing spawn-success tests retrofitted with workspaceMode assertions
  • 3 tmux capture-pane E2E tests verifying on-screen badge visibility:
    • allow-fallback + non-git dir → ● started [shared workspace] + ⚠ Command failed: git worktree add ...
    • strict + non-git dir → ✗ failed: Workspace provisioning failed for '...'
    • allow-fallback + real git repo → ● started (no badge — happy path)

Full suite: 4898 pass, 0 fail.

Test plan

  • Run bun test — 4898 pass
  • Run bun test tests/tui/workspace-isolation-e2e.test.ts — 3 pass (requires tmux)
  • Start TUI against a non-git directory with default policy — agents show [shared workspace] badge
  • Start TUI against a real grove with git — agents show plain ● started

Fixes #208. Workspace provisioning failures in
SessionOrchestrator and SpawnManager were silently swallowed, causing
agents to run in a shared project root with no operator visibility.

Changes:
- workspace-provisioner: fully async via execFileAsync (no event-loop
  block, no shell injection); parallel cleanup with Promise.allSettled
- WorkspaceMode tagged union (isolated_worktree | fallback_workspace |
  bootstrap_failed) + WorkspaceIsolationPolicy (strict | allow-fallback)
  exported from core/index
- session-orchestrator: provisionAgentWorkspace() helper with exhaustive
  policy-aware control flow; workspaceMode on AgentSessionInfo; default
  policy strict
- spawn-manager: provisionWorkspace() replaces inline execSync git
  worktree; setIsolationPolicy(); workspaceMode on SpawnResult; default
  policy allow-fallback (TUI backward compat)
- session-service: threads workspaceIsolationPolicy through to orchestrator
- spawn-progress: [shared workspace] / [no config] badge + ⚠ reason line
  for degraded agents; screen-manager threads workspaceMode from SpawnResult
- Tests: workspace-bootstrap unit tests (7), policy enforcement tests in
  orchestrator (5) and spawn-manager (2), retrofitted all existing spawn
  tests with workspaceMode assertions
- E2E: tmux capture-pane test (3 scenarios) verifies badge visible on
  screen for fallback_workspace, hard failure for strict, clean for
  isolated_worktree
delegates/feeds/escalates edges now cause the target role's worktree to
branch off the source role's grove branch instead of HEAD.

  coder → reviewer (delegates)
    coder  worktree: grove/<sessionId>/coder   (base: HEAD)
    reviewer worktree: grove/<sessionId>/reviewer (base: grove/<sessionId>/coder)

This gives reviewers a real git branch relationship — they can run
`git merge grove/<sessionId>/coder` to get the coder's latest commits
without touching the main checkout.

Other edge types (reports, requests, feedback) remain independent (HEAD).

Changes:
- topology: WORKSPACE_BRANCH_EDGES constant, resolveRoleWorkspaceStrategies(),
  topologicalSortRoles() — Kahn's algorithm ensures source branches exist
  before dependents try to base their worktrees on them
- session-orchestrator: provision workspaces in topological order then spawn
  in parallel; passes resolved baseBranch to provisionAgentWorkspace()
- grove-md-builder: annotates edge_type lines with workspace behaviour comment
  so GROVE.md is self-documenting
- sqlite: worktree_strategy_json column on sessions table (migration + DDL);
  stores resolved strategies at creation time for operator visibility
- session: worktreeStrategies field on Session type
- tests: 13 new unit tests for resolveRoleWorkspaceStrategies and
  topologicalSortRoles covering all 6 edge types + chain ordering
SpawnManager now uses resolveRoleWorkspaceStrategies() to pick the correct
baseBranch per role based on topology edge types. ScreenManager passes the
resolved topology to SpawnManager before spawning begins.

Also fixes the branch-naming mismatch: spawn() now uses wsSessionId
(stable session-level ID) for provisionWorkspace so the branch computed
by resolveRoleWorkspaceStrategies matches the branch actually created for
the source role.

Validated via 3 tmux capture-pane scenarios:
  1. delegates + real git → both isolated_worktree, no badge
  2. feedback  + real git → both isolated_worktree, independent
  3. delegates + no git   → both fallback_workspace, [shared workspace] badge
The TUI screen-manager fired all role spawns concurrently, racing
coder and reviewer worktree provisioning. With delegates edges the
reviewer needs coder's git branch to exist first — concurrent spawns
caused reviewer to fail with "git worktree add ... branch not found"
and fall back to shared workspace.

Fix: import topologicalSortRoles and spawn sequentially — source roles
complete before dependents start. Validated via real TUI E2E:
  1. grove up → New Session → review-loop → goal → launch
  2. Spawn-progress shows: coder spawning → coder started → reviewer spawning → reviewer started
  3. git worktree list confirms both worktrees on correct branches
  4. Reviewer branch based on coder branch (delegates edge)
SessionOrchestrator hardcoded mcpServePath as join(projectRoot, "src/mcp/serve.ts")
which only works when the project IS the grove repo. For any other project
(e.g. /tmp/foo), agents got a non-existent MCP server path and couldn't
call grove_submit_work, grove_submit_review, or grove_done.

Fix: extract resolveMcpServePath() into shared utility that derives the
path from process.argv[1] (the running grove CLI entry point), climbing
3 levels to find the grove installation directory. Falls back through
dist/mcp/serve.js → src/mcp/serve.ts → import.meta.url variants.

DRYs up SpawnManager which had the same logic inline in two places.

Validated E2E: real claude agent via acpx in /tmp test project
successfully called grove_submit_work — contribution CID stored in
session_contributions with the correct session link.
…from edge type

Edge type describes communication semantics. Workspace strategy describes
file isolation. These are independent concerns — a reviewer might want a
delegates edge but still work from HEAD.

Before: WORKSPACE_BRANCH_EDGES constant implicitly derived workspace
strategy from edge type (delegates/feeds/escalates → branch_from_source).
No user control, surprise behavior when adding edges.

After: explicit opt-in field on each edge:
  edges:
    - target: reviewer
      edge_type: delegates
      workspace: branch_from_source   # optional, default: independent

- Removed WORKSPACE_BRANCH_EDGES constant
- resolveRoleWorkspaceStrategies checks edge.workspace, not edge type
- topologicalSortRoles checks edge.workspace for ordering
- Zod schema validates "branch_from_source" | "independent"
- Wire conversion passes workspace through
- review-loop preset updated with explicit workspace: branch_from_source
- Reverted grove-md-builder YAML comments (GROVE.md unchanged)
- grove-md-builder renders workspace field when present
…ediately

grove session start exited right after spawning agents — the reviewer
never got a chance to receive routed events from the coder's contribution.

Fix: SessionOrchestrator.waitForCompletion() polls agent status every 3s
and resolves when all agents are idle/stopped or timeout (5min) expires.
CLI awaits this before marking session complete and closing DB.

Validated: session now reaches "All agents idle — session complete" with
coder submitting work via grove_submit_work. Reviewer receives routed
events but doesn't call grove_submit_review (agent prompt issue, not
infrastructure).
Three bugs prevented the review-loop handoff from completing:

1. Missing .acpxrc.json in orchestrator path
   bootstrapWorkspace only wrote .mcp.json, but acpx reads .acpxrc.json.
   Without it, agents had zero MCP tools — couldn't call grove_submit_work
   or grove_submit_review. Fix: write .acpxrc.json alongside .mcp.json.

2. Duplicate prompts
   orchestrator.start() sent goals via runtime.send() AFTER spawn() already
   sent them as initial prompt. Reviewer got the same prompt twice. Fix:
   remove the redundant send loop — spawn passes goals via AgentConfig.goal.

3. Premature session stop
   checkAllIdle() stopped the session when both agents went idle, even if
   no contributions existed yet. Coder goes idle briefly between editing
   and calling grove_submit_work. Fix: 30s grace period — don't auto-stop
   until at least one contribution exists or 30s have passed.

Validated E2E with real claude agents via acpx:
  coder  → work: "Added greet(name) function"
  reviewer → review: "LGTM — greet(name) correctly returns..."
  reviewer → discussion: "[DONE] Approved — implementation is correct"
Edge types simplified to 3 core types:
  delegates — forward work (A produces, B acts on it)
  feedback  — response (B sends results back to A)
  monitors  — observe-only (A watches B)

Legacy aliases (reports, feeds, requests, escalates) kept in schema for
backward compat but all presets migrated to use core types only.

Routing message now includes source branch and merge instructions so the
receiving agent can actually see file changes:
  "To see the actual file changes, run: git merge grove/<session>/<source>"

CLAUDE.md updated with workflow steps:
  1. git merge <source_branch> to get their files
  2. Review the code
  3. Call grove_submit_review with targetCid from notification
Issue A — git commit as work product:
  grove_submit_work now accepts commitHash (git SHA) instead of requiring
  CAS-per-file artifacts. Agents just commit and pass the hash. Other
  agents git fetch + git diff to see changes. Old artifacts path still
  works for backward compat.

  Changed: models.ts (commitHash field), contributions.ts (MCP schema),
  contribute.ts (skip CAS validation when commitHash present),
  acpx-runtime.ts (system-reminder uses git workflow),
  workspace-bootstrap.ts (CLAUDE.md teaches git-based workflow)

Issue B — broadcast topology mode:
  Roles can set mode: "broadcast" to notify all other roles without
  explicit edges. No more N×(N-1) edge definitions for swarms.

  review-loop preset: coder and reviewer both use mode: "broadcast"
  instead of explicit delegates/feedback edges.

  Changed: topology.ts (mode field on role schema + wire type),
  topology-router.ts (broadcast routes to all roles),
  grove-md-builder.ts (renders mode), review-loop.ts preset

Edge types (delegates/feedback/monitors) kept in schema for backward
compat but no longer required when using broadcast mode.
…oundary

MCP tools run in child processes (spawned by acpx/claude) with separate
EventBus instances. Events published from grove_submit_work in the MCP
process never reached the orchestrator's EventBus in the parent process.

Fix: SessionOrchestrator polls the SQLite contribution store every 3s,
detects new CIDs via a seen-set, and forwards them to downstream agents
via runtime.send(). This is the same pattern the TUI uses via
SpawnManager.startContributionPolling().

Validated: reviewer log shows the routed contribution message arriving
with CID, summary, and source branch — the polling successfully bridges
the process boundary.
…gent turns

Contributions from fast agents (coder) were arriving in the same acpx
session turn as the reviewer's initial prompt — the agent treated them
as context instead of an action trigger and went idle.

Fix: seed seenCids with pre-existing contributions, delay first poll by
15s so agents process their initial prompt first. Skip EventBus-based
contribution forwarding entirely (polling handles it, EventBus can't
cross process boundaries anyway).

Validated E2E — full coder→reviewer handoff:
  work(coder):       Added greet(name) function
  review(reviewer):  LGTM — clean implementation
  work(coder):       Added greet(name) function (iteration)
  discussion(reviewer): [DONE] Approved — implementation meets standards
Reviewer now gets the source agent's workspace path in the notification
instead of a git branch name. No git merge needed — the reviewer reads
files directly from the source workspace path:

  [grove] New work from coder:
    Workspace: /tmp/project/.grove/workspaces/coder-abc/

  ACTION REQUIRED: Read the source files at /tmp/.../coder-abc/

The filesystem IS the handoff artifact. Both agents are on the same
machine — the reviewer just reads the coder's directory.

Updated CLAUDE.md and acpx system-reminder to teach agents the
path-based workflow instead of git merge.
pollContributions now checks for [DONE] prefix or context.done=true
in contributions. When detected, stops the session with a descriptive
reason. This mirrors use-done-detection.ts from the TUI layer.

Combined with the CLI's markDone(), the session status in SQLite
transitions to "completed" with the proper stop reason.
Coder prompt: git commit → git rev-parse HEAD → grove_submit_work({ commitHash })
Reviewer prompt: read files at Workspace path from notification, not summary

Validated E2E — full coder→reviewer handoff with real file inspection:
  1. Coder edits app.js, commits, submits work
  2. Polling routes notification with workspace path to reviewer
  3. Reviewer reads coder's actual files (Read /path/to/coder/app.js)
  4. Reviewer submits review with real code assessment
  5. Reviewer calls grove_done → session status=completed
  6. Project root untouched, worktrees isolated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: do not silently fall back to shared workspace when agent worktree/bootstrap fails

1 participant